feat: improve maintainers detection [CM-1033] by mbani01 · Pull Request #3908 · linuxfoundation/crowd.dev

mbani01 · 2026-03-10T15:41:38Z

What changed

Before

File discovery was a sequential scan of a hard-coded flat list (MAINTAINER_FILES: 13 entries, root-only, no recursion).
The first matching file was used — no ranking, no scoring, no fallback strategy.
README.md was in the candidate list and required a simple content check for the word "maintainer".
AI file-selection received a plain list of filenames with no signals to rank them.
extract_maintainers always started from scratch — no reuse of a previously found file.
compare_and_update_maintainers skipped all maintainers with github_username == "unknown", including those with a valid email; no email fallback for identity lookup.
candidate_files and ai_suggested_file did not exist in MaintainerResult or execution metrics.
The full-content AI extraction prompt was always built upfront, even when the content was going to be chunked.

After

Detection pipeline (4-step with fallback)

Saved file reuse — if a maintainer file was found on a previous run, it is tried first before any scanning.
Ripgrep recursive search + scoring — rg scans the full repo for files matching 20 governance stems (MAINTAINERS, OWNERS, CODEOWNERS, GOVERNANCE, EMERITUS, etc.) across all depths and valid extensions. Each file is scored: exact known path (100), exact stem match (50), partial stem (25), plus +1 per governance keyword found in content. All candidates are returned sorted by score; the top one is analyzed.
README guard — README files are rejected immediately (no AI call) unless their content contains the word maintainer.
AI file-selection fallback — if the top candidate fails, the full repo file list is scanned, pre-filtered to governance-scored files (capped at 300) with the already-failed file excluded, and passed to AI as (filename, score) tuples. The prompt instructs the model to prefer higher scores, shallower paths, and to reject files inside vendor/, node_modules/, third_party/, external/, and similar third-party directories.

Bug fixes

compare_and_update_maintainers: the skip guard now only fires when both github_username and email are unknown/None (previously skipped all "unknown" usernames unconditionally). New maintainers identified by email now go through find_maintainer_identity_by_email as a fallback, matching insert_new_maintainers behaviour.
Extraction prompt for chunked content is now built lazily inside the else branch, avoiding a wasted string allocation on every large file.

Observability

MaintainerResult gains candidate_files: list[tuple[str, int]] and ai_suggested_file: str | None.
ServiceExecution metrics now record candidate_files (top-100 by score) and ai_suggested_file on every run.

Note

Medium Risk
Refactors the maintainer extraction pipeline to rely on recursive ripgrep-based discovery, scoring, and AI fallback, which can change which governance file is selected and affects runtime behavior/cost. Also adds a new runtime dependency (ripgrep) and persists more execution metadata, so failures or environment mismatches could impact maintainer processing.

Overview
Improves maintainer extraction by replacing the prior hard-coded maintainer filename scan with a multi-step pipeline: reuse the previously saved maintainer file when available, otherwise perform recursive ripgrep-based candidate discovery with filename/content scoring, then fall back to an AI-driven file picker fed with scored candidates.

Adds guards and fixes around maintainer ingestion: README candidates are rejected unless they mention maintainer, unknown usernames are only skipped when email is also unknown, and identity lookup now falls back to find_maintainer_identity_by_email when github_username is unknown. Extends MaintainerResult and ServiceExecution metrics to record top candidate_files and the ai_suggested_file, and updates the git-integration Docker image to include ripgrep.

^{Written by Cursor Bugbot for commit b38abdc. This will update automatically on new commits. Configure here.}

CLAassistant · 2026-03-10T15:41:56Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Copilot

Pull request overview

This PR improves maintainer file detection in the git integration service by adding a multi-step discovery and analysis flow that combines static filename matching, dynamic ripgrep-based content search, and an AI fallback, while also surfacing more metadata about what was tried.

Changes:

Added ripgrep-based repo scanning (rg --files and keyword search) with fallback to os.walk, plus scoring/filtering of dynamic candidates.
Refactored maintainer extraction to prioritize a previously saved maintainer file, then analyze top candidates, then use AI file suggestion as a last resort.
Extended MaintainerResult and service execution metrics to include candidate_files and ai_suggested_file; added ripgrep to the Docker image.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py	New candidate discovery + fallback extraction flow; logs and metrics now include candidate/AI-suggested file metadata.
services/apps/git_integration/src/crowdgit/models/maintainer_info.py	Adds new result metadata fields (`candidate_files`, `ai_suggested_file`).
scripts/services/docker/Dockerfile.git_integration	Installs `ripgrep` in the runner image to support dynamic search.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

services/apps/git_integration/src/crowdgit/models/maintainer_info.py

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

… detection Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

…rd in content Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

…improve prompt Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

mbani01 · 2026-03-11T14:49:32Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+    KNOWN_PATHS = {
+        "maintainers",
+        "maintainers.md",
+        "maintainer.md",
+        "codeowners",
+        "codeowners.md",
+        "contributors",
+        "contributors.md",
+        "owners",
+        "owners.md",
+        "authors",
+        "authors.md",
+        "governance.md",
+        "docs/maintainers.md",
+        ".github/maintainers.md",
+        ".github/contributors.md",
+        ".github/codeowners",
+    }
+
+    # Governance stems (basename without extension, lowercased) for filename search
+    GOVERNANCE_STEMS = {
+        "maintainers",
+        "maintainer",
+        "codeowners",
+        "codeowner",
+        "contributors",
+        "contributor",
+        "owners",
+        "owners_aliases",
+        "authors",
+        "committers",
+        "commiters",
+        "reviewers",
+        "approvers",
+        "administrators",
+        "stewards",
+        "credits",
+        "governance",
+        "core_team",
+        "code_owners",
+        "emeritus",
+    }
+
+    VALID_EXTENSIONS = {
+        "",
+        ".md",
+        ".markdown",
+        ".txt",
+        ".rst",
+        ".yaml",
+        ".yml",
+        ".toml",
+        ".adoc",
+        ".csv",
+        ".rdoc",
+    }
+
+    SCORING_KEYWORDS = [
+        "maintainer",
+        "codeowner",
+        "owner",
+        "contributor",
+        "governance",
+        "steward",
+        "emeritus",
+        "approver",
+        "reviewer",
    ]

+    EXCLUDED_FILENAMES = {
+        "contributing.md",
+        "contributing",
+        "code_of_conduct.md",
+        "code-of-conduct.md",
+    }


Those were mainly inferred from our processing history.

joanagmaia

This looks great, I have a couple of questions and requests to make sure that we have some more metrics given that these are big changes on the current process.

Questions:

With the mechanism of only picking one file for analysis we are assuming that all maintainers information will only be in 1 file right? I'm not sure if we should make sure that we won't lose data because of it.

Requests:

Can we run the new mechanism in like 10 repos and see the accuracy? I would even say on the current issues we have opened on Insights as well to see if we have improved coverage https://github.com/linuxfoundation/insights/issues?q=is%3Aissue%20state%3Aopen%20maintainer
Can we prepare a monitor in metaplane that covers the amount of repositories where we can get maintainers data for? And also the amount of projects?
Can we test using the Haiku model for find_maintainer_file_with_ai since it would be a simpler task then the rest of the work?

joanagmaia · 2026-03-11T16:10:56Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+    MAX_AI_FILE_LIST_SIZE = 300
+
+    # Full paths that get the highest score bonus when matched exactly
+    KNOWN_PATHS = {


We should also include SECURITY-INSIGHTS.md. It was supported before as well.
E.g. https://github.com/open-telemetry/opentelemetry-dotnet/blob/d54379e28c07db783452a33e119f1cdf8e7d96a6/SECURITY-INSIGHTS.yml#L13

joanagmaia · 2026-03-11T16:41:21Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+    }
+
+    # Governance stems (basename without extension, lowercased) for filename search
+    GOVERNANCE_STEMS = {


Should we also add:

workgroup (e.g. https://github.com/open-feature/community/blob/d2f54702a4bca67cd7781a8fed91e9809ecc4a0a/config/open-feature/sdk-ruby/workgroup.yaml#L15)

My only concern here is that it seems that they use the community repo to manage some maintainers data. So here we might need to infer the repository based on the directory structure. Maybe it's too complex for us to want to support at least for now

It's tricky when repo and maintainers are in different places, will check how we can support this easily

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-11T17:24:12Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

            role = maintainer.normalized_title
            original_role = self.make_role(maintainer.title)
-            if github_username == "unknown":
+            if github_username == "unknown" and maintainer.email in ("unknown", None):


Dict dedup silently drops "unknown" username maintainers

High Severity

new_maintainers_dict is built with {m.github_username: m for m in maintainers}, so when multiple maintainers have github_username="unknown", only the last one survives. Previously all "unknown" entries were unconditionally skipped, so the dedup was harmless. Now that the skip guard allows "unknown" usernames with valid emails through for email-based identity lookup, all but the last "unknown" maintainer are silently dropped before processing. Contrast with insert_new_maintainers, which iterates the list directly and processes every entry.

Additional Locations (1)

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py#L193-L204

cursor · 2026-03-11T17:24:13Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+        "docs/maintainers.md",
+        ".github/maintainers.md",
+        ".github/contributors.md",
+        ".github/codeowners",


Case mismatch prevents SECURITY-INSIGHTS.md from matching

Medium Severity

KNOWN_PATHS contains "SECURITY-INSIGHTS.md" in uppercase, but _score_filename lowercases candidate_path before checking membership. The lowercased "security-insights.md" will never match the uppercase entry. Additionally, no GOVERNANCE_STEMS entry matches "security-insights", so _ripgrep_search won't discover the file either. This file type is effectively unsupported despite being listed.

Additional Locations (1)

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py#L486-L497

cursor · 2026-03-11T17:24:13Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+            line[2:] if line.startswith("./") else line
+            for line in output.strip().split("\n")
+            if line.strip()
+        ]


Empty extension glob matches all files in listing

Medium Severity

VALID_EXTENSIONS includes "" (empty string), so _list_repo_files generates --iglob "*" which matches all files. Since multiple ripgrep --iglob include patterns use OR logic, this wildcard makes all other extension-specific patterns redundant. The function returns every file in the repo instead of filtering to document-like extensions. This degrades the Step 4 AI fallback: when no files score above zero, the first 300 files sent to AI will be arbitrary (likely source code) rather than text/document files.

cursor · 2026-03-11T17:24:13Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

        ai_cost = 0.0
        maintainers_found = 0
        maintainers_skipped = 0
+        candidate_files: list[str] = []


Type annotation mismatch for candidate_files variable

Low Severity

candidate_files is declared as list[str] but is later assigned maintainers.candidate_files which is list[tuple[str, int]] (path and score pairs). The annotation is misleading and inconsistent with MaintainerResult.candidate_files. This won't crash at runtime since Python doesn't enforce type hints, but it obscures the actual data shape written into the metrics dict.

mbani01 self-assigned this Mar 10, 2026

Copilot AI review requested due to automatic review settings March 10, 2026 15:41

Copilot started reviewing on behalf of mbani01 March 10, 2026 15:42 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

cursor bot reviewed Mar 10, 2026

View reviewed changes

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py Show resolved Hide resolved

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py Outdated Show resolved Hide resolved

mbani01 requested a review from joanagmaia March 10, 2026 16:42

mbani01 marked this pull request as draft March 10, 2026 17:54

mbani01 added 9 commits March 11, 2026 14:05

chore: install ripgrep

14a902a

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

feat: leverage maintainersFile from db before falling back to regular…

b597c99

… detection Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

feat: improve maintainers detection & analysis

5cb07fa

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

feat: track analyzed maintainers files in metrics

dd7d2c6

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

feat: change candidate file detection to be more narrow

1eb1483

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

fix: enable email fallback for identity lookup during maintainer update

b19c8b2

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

chore: avoid bulding ai prompt when full content if batching is required

98ea9ce

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

fix: remove duplicate rg pattern

cdfc93d

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

chore: add extra validation for reamde files to have maintainer keywo…

b4dd488

…rd in content Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

mbani01 force-pushed the feat/improve_maintainer_file_detection branch from bc8e3df to b4dd488 Compare March 11, 2026 14:05

mbani01 added 2 commits March 11, 2026 14:32

feat: improve ai fallback detection by passing scored candidates and …

3ae091f

…improve prompt Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

chore: limit candiate_files saved in db to 100

59cff58

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

mbani01 marked this pull request as ready for review March 11, 2026 14:34

mbani01 commented Mar 11, 2026

View reviewed changes

joanagmaia reviewed Mar 11, 2026

View reviewed changes

chore: add extra filename & stems

b38abdc

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>

cursor bot reviewed Mar 11, 2026

View reviewed changes

Conversation

mbani01 commented Mar 10, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Before

After

Detection pipeline (4-step with fallback)

Bug fixes

Observability

Uh oh!

CLAassistant commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbani01 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

joanagmaia left a comment

Choose a reason for hiding this comment

Uh oh!

joanagmaia Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

joanagmaia Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

mbani01 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 11, 2026

Choose a reason for hiding this comment

Dict dedup silently drops "unknown" username maintainers

Uh oh!

cursor bot Mar 11, 2026

Choose a reason for hiding this comment

Case mismatch prevents SECURITY-INSIGHTS.md from matching

Uh oh!

cursor bot Mar 11, 2026

Choose a reason for hiding this comment

Empty extension glob matches all files in listing

Uh oh!

cursor bot Mar 11, 2026

Choose a reason for hiding this comment

Type annotation mismatch for candidate_files variable

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mbani01 commented Mar 10, 2026 •

edited by cursor bot

Loading